Data Description:

The file Loan_Modelling.csv contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (9%) accepted the personal loan that was offered to them in the earlier campaign.

Domain:

Banking

Context:

This case is about AllLife Bank whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the AllLife Bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

Learning Outcomes:
  1. Exploratory Data Analysis
  2. Data Cleaning
  3. Data Visualization
  4. Preparing the data to train a model
  5. Training and making predictions using a classification model
  6. Model evaluation
Objective:

The classification goal is to predict the likelihood of a liability customer buying personal loans which means we have to build a model which will be used to predict which customer will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset. Here I will be using the Supervised Learning methods to predict which model is best for this problem amongst Logistic Regresssion, K-Nearest Neighbors(KNN) and Naive Bayes Algorigthm.

Import the necessary libraries :

Comment : Here I have used numpy, pandas, matplotlib, seaborn, scipy for EDA and Data Visualization. Also used sklearn for data spliting, model building and for confusion matrix.

Exploratory Data Analysis

Read the data as a data frame :-

Comment: Here I have read the Personal Loan dataset using read_csv() function of pandas. df is a dataframe. I have used head() funtion to display first 5 records of the dataset.

Target Column rearrange:- As our Target Column(Personal Loan) is in middle of dataframe so for more convinient I have drop the personal loan column from the original place and appended at last of dataframe.

Features(attributes) Understanding from the above dataframe :-

  1. The ID variable can be ignored as it will not any effect on our model. As we know customer Id is just to maitain the record in serial order. There is no relationship with Id and Loan.
  2. Target Variable is Personal Loan which describe whether the person has taken loan or not. This is the variable which we need to predict.

Nonimal Varibles :

  1. ID - Customer ID
  2. ZIP Code - Home Address ZIP code of the customer. This variable can also be ignored becasue we can not judge the customers based on thier area or location.

Ordinal Categorical variables :

  1. Family - Number of famlily member of the customer
  2. Education - Education level of the customer. In our dataset it ranges from 1 to 3 which are Under Graduate, Graduate and Post Graduate respectivly.

Interval Variables :

  1. Age - Age of the customer
  2. Experience - Years of experience of customer has
  3. Income - Annula Income of the customer which is in dollars
  4. CCAvg - Avg. spending on credit cards per month which in dollars.
  5. Mortgage - Value of House Mortgage

Binary Categorical Variable :

  1. CD Account - Does the customer have CD Account with bank or not?
  2. Security Account - Does the customer have Security Account with bank or not?
  3. Online - Does the customer have Online banking facility with bank or not?
  4. Credit Card - Does the customer have a credit card issued by Universal Bank or not?
  5. Personal Loan - This our target variable which we have to predict. This indicates that the customer has token loan or not?
Shape of the data :-

Comment: Shape of the dataframe is (5000, 14). There are 5000 rows and 14 columns in the dataset.

Data type of each attribute :-

comment : We can also display the data types of dataframe using df.info() function which gives even more useful info.

Comment : Here we can see that all the variables are numerical. But the columns 'CD Account', 'Online', 'Family', 'Education' , 'CreditCard' and 'Securities Account' are categorical variable which should be in 'category' type.

Data Preprocessing

Processing Zipcode

Zipcode is a categorical feature and can be a good predictor of target variable. We can analyse if there is any pattern on location for customers who had borrowed loaned during previous campaign. Trying to see if we can reduce the category

We got almost all county expect for 96651,92634,93077,92717. We can fix this zip code by searching internet. Couldn't find for other zipcodes.

df.info()

Fixing the data types

Personal_Loan, Securities_Account, CD_Account, 'Online', 'CreditCard' ,Education are of int/object type, we can change them to category type to reduce the dataspace required.

we can see that the memory usage has decreased from 547.0 KB to 305.4 KB

Processing Experience

Exploratory Data Analysis

Observations

  1. Customers age is in range of 23 - 67, with mean and median of ~45.
  2. Maximum experience is 43 years. where as mean and median are ~20.
  3. Income are in range 8k to 224k USD. Mean is 73k USD and median is 64k USD. 224 Max salary need to be verified
  4. Maximum mortgage taken is 635k USD.Need to verify this
  5. Average spending on credit card per month ranges from 1- 10k with mean of 1.9kUSD and median of 1.5k USD
  6. 1095 customers are from Los Angeles County.
  7. 480 customers had borrowed loan before.

Univariate Analysis

Observations

Age and experience both has same distrubtion with spike at 5 points. Income is right skewed and has some outlier on higher side which can be clipped. Average montly credit is right skewed and has lot of outliers on higher side which can be clipped. Mortgage is mostly 0 . but is right skewed and has lot of outlier on higher side which can be clipped.

Age

Age can be a vital factor in borrowinng loan, converting ages to bin to explore if there is any pattern

Income

To understand customers segments derving new columns which will help us identify if customer belongs to Upper , middle or lower income group

Spending

To understand customers spending derving new columns which will say if customer belongs to Upper , middle or lower spending

Observations

  1. ~29.4 % customers are single.
  2. ~41.9% customers are undergrad.
  3. ~9.6% bought a personal loan from the bank.
  4. 10.4 % customers have a securities account with the bank
  5. 6 % customer have a CD account.
  6. 60% customers transact online.
  7. 29.4% customers have credit cards.
  8. ~ 75 % of customers are in range of 31- 60.
  9. ~ 50 % Most of bank customers belong to middle income group.
  10. ~48 % of customers has medium Average spending

It can be seen the percentage of loan taken from various country differ.There are so many county converting them to regions will help in our model

Converting the county to regions based on https://www.calbhbc.org/region-map-and-listing.html

Bivariate & Multivariate Analysis

Observations

As expected Age and experience are highly correlated and one of them can be dropped.Since we had to handle 0, will drop experience. Income and Average spending on credit card are positively corrleated. Mortgage has very little correlation with income.

Observations

People with higher income had opted for personal loan before.

People with high mortgages opted for loan.

Customers with higher average monthly credit usage have opted for loan.

Customers with higher income had higher average credit card usage and mortgage.

Graduate and Advanced/Professional have higher monhtly credit card usage and have borrowed loans with the bank.

Observations

  1. Number of Customers with Family size of 3 who had borrowed loans from the bank is greatet than other family size
  2. 60 of those who had Personal loan with the bank also had Securities_Account.
  3. Customers who had certificate of deposit with the bank had previously borrowed loan
  4. Customers using Online facilities has no impact on personal loan
  5. Majority customers who did have Personal loan with the bank did not used CrediCard from other banks.
  6. Majority customers who had take personal loan before are from LosAngeles and Bay region.
  7. Ratio of borrowing loan is high in 30 and below and 60 and above customers.
  8. Customer with high average Monthly spending have bought personal loan before
  9. As expected Age and experience are highly correlated and one of them can be dropped. Since experience had negative values dropping experience would be better option.

Check distrubution in target column

The target variable personal_loan is highly imbalanced where only 9.6% of the customers have previously opted for a personal loan in the dataset. This can be handled using weight or SMOTE.But for now we will carry with on without SMOTE

Insights based on EDA

Summary of EDA

Data Description:

  1. Dependent variable is the Personal_loan which is of categorical data type.
  2. Age, Experience, Income,mortage ,CCavg are of integer type while other variables are of categorical type
  3. There were no missing values in the dataset.

Data Cleaning:

  1. We observed that some observations where experience = -ve but since there was a strong correlation with age , we dropped experience.
  2. There are 450 unique zipcode, we mapped it to counties. They were further mapped to regions to reduce the dimension of data and we now have only 5 distinct values in the data.
  3. We also created Age bin ,Spending group and Incomegroup to analyse in there is any pattern in buying loan based on these.

Observations from EDA:

  1. People with higher income had opted for personal loan before.
  2. People with high mortgages opted for loan.
  3. Customers will higher average monthly credit usage have opted for loan.
  4. Customers with Family of 3 members had borrowed the loans with the bank.
  5. Education level 2: Graduate and 3: Advanced/Professional have borrowed loans with the bank.
  6. Customers who had certificate of deposit with the bank had previously borrowed loan
  7. Majority of customers who did have Personal loan with the bank used Online facilities.
  8. Majority customers who had take personal loan before are from LosAngeles region.
  9. Ratio of borrowing loan is high in 30 and below and 60 and above customers.
  10. The more income you get the more you spend and have a "large than life" lifestyle.
  11. Customer segmentation for borrowing loan based on EDA
  12. Customer with Higher income have higher mortages and higher monthly average spending.They also have certificate of deposit with the bank.They are our high profile clients.
  13. Few Customer in medium income group don't have higher mortages and have less average monthly credit card spending .They are average profile clients.
  14. Customer in lower income group have less mortages( few outliers are there) ,less monthly spending. They are our low profile clients.

Actions for data pre-processing:

  1. Many variables have outliers that need to be treated.
  2. We can drop Experience, Country,Zipcode and Agebin,Income_group,Spending_group.2

Outliers detection:

This is some really extreme values in income 224K USD compared to same age group and experience. Values for Credit card and Mortages looks fine.After identifying outliers, we can decide whether to remove/treat them or not. It depends,here I am not going to treat them as there will be outliers in real case scenario (in Income, Mortgage value, Average spending on the credit card, etc) and we would want our model to learn the underlying pattern for such customers.

We have 6 categorical independent variables but 4 of them are binary, so we'll have the same results with them even after creating dummies So we will only make dummies for Regions and Education.

Model building Logistic Regression

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a person will buy a loan but he actually doesn't.(Loss of Resource)
  2. Predicting a person will not buy a loan but he actually does.(Loss of Opportunity)

Which case is more important?

  1. The whole purpose of the campagin is to bring in more customers. 2nd case is more important to us .A potential customer is missed by the sales/marketing team .It's lost of opportunity.So we want to minimize this loss.

How to reduce losses?i.e need to reduce False Negatives ?

  1. In this case, not being able to identify a potential customer is the biggest loss we can face. Hence, recall is the right metric to check the performance of the model.Banks wants Recall to be maximized, greater the recall lesser the chances of false negatives.
  2. We can use accuracy but since the data is imbalanced it would not be the right metric to check the model performance.
  3. Therefore, Recall should be maximized, the greater the Recall higher the chances of identifying both the classes correctly.
Logistic Regression (with Sklearn library)

add_score_model(scores_Sklearn)

Logistic Regression (with Statmodel)

Test Assumption

MultiCollinearity

we will have to check and remove multicollinearity from the data to get reliable coefficients and p-values. There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.* General Rule of thumb: If VIF is 1 then there is no correlation among the predictor and the remaining predictor variables. Whereas if VIF exceeds 5, we say it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.

Observations: There is no correlation between predicator variables

In this case 'Regions' all the attributes have a high p-value which means it is not significant therefore we can drop the complete variable.

The Pvalue for Mortgage is 0.264 So droping Mortage

Dropping Age as pvalue is greater than 0.05

CCavg is important parameter as per EDA so not dropping it

ROC-AUC curve

Roc -Auc curve on Train data

Roc -Auc curve on Test data

Logistic Regression model is giving a generalized performance on training and test set. ROC-AUC score of 0.96 on training and test set is quite good.

Coefficient interpretations

  1. Coefficient of Income, Education, Family,CCavg,CD account,Age, are positive , ie a one unit increase in these will lead to increase in chances of a person borrowing loan
  2. Coefficient of Securities account,online ,Credit card are negative, increase in these will lead to decrease in chances of a person borrowing a loan.

Converting coefficients to odds

  1. The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  2. Therefore, odds = exp(b)
  3. Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)
  4. The percentage change in odds is given as odds = (exp(b) - 1) * 100
  1. Income: Holding all other features constant a 1 unit change in Income will increase the odds of a customer taking a personal loan by 20 times or a 95% chance of a customer taking personal loan.
  2. Family: Holding all other features constant a 1 unit change in Family will increase the odds of a customer taking a personal loan by 2.16 times increase in the odds of a customer taking personal loan.
  3. CCAvg: Holding all other features constant a 1 unit change in CCAvg will increase the odds of a customer taking a personal loan by 1.22 times or a 22.16% increase in the odds of a customer taking personal loan.
  4. Education Advance has 7 times higher chances of taking a personal loan than undergraduate Interpretation for other attributes can be done similarly.

Most overall significant varaibles are Income,Education, CD account ,Family and CCAvg

Model performance evaluation and improvement

Insights:

True Positives:

Reality: A customer wanted to take personal Loan. Model Prediction: The customer will take personal loan. Outcome: The model is good.

True Negatives:

Reality: A customer didn't wanted to take personal loan. Model Prediction: The customer will not take personal loan. Outcome: The business is unaffected .

False Positives :

Reality: A customer didn't want to take personal loan. Model Prediction: The customer will take personal loan. Outcome: The team which is targeting the potential customers would waste their resources on the customers who will not be buying a personal loan.

False Negatives:

Reality: A customer wanted to take personal Loan. Model Prediction: The customer will not take personal loan. Outcome: The potential customer is missed by the salesteam. This is loss of oppurtunity. The purpose of campaign was to target such customers. If team knew about this customers, they could have offered some good APR /interest rates.

Right Metric to use:

Here not able to identify a potential customer is the biggest loss we can face. Hence, Recall is the right metric to check the performance of the model .We have recall as 68 on train and 67 on test. False negative are 107 and 47 on train and test. We can further improve this score using Optimal threshold for ROC AUC curve and precision recall curve

Optimal threshold using AUC-ROC curve

With 0.092 Threshold the Recall score has improved from 68% to 87% on test data with 89% accuracy. Also False negative values has decreased to 18 from 46 for testdat. ROC-AUC score is 88 which is good.

Let's use Precision-Recall curve and see if we can find a better threshold

With this model the False negative cases have gone up and recall for test is 72 with 95 % accuracy. Model is performing well on training and test set. Model has given a balanced performance, if the bank wishes to maintain a balance between recall and precision this model can be used. Area under the curve has decreased as compared to the initial model but the performance is generalized on training and test set.

Using Sequential Feature Selection

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.6s finished

[2021-08-02 18:32:56] Features: 1/16 -- score: 0.9941176470588236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.8s finished

[2021-08-02 18:32:57] Features: 2/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.9s finished

[2021-08-02 18:32:58] Features: 3/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 1.7s finished

[2021-08-02 18:33:00] Features: 4/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 1.6s finished

[2021-08-02 18:33:02] Features: 5/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 1.6s finished

[2021-08-02 18:33:03] Features: 6/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.5s finished

[2021-08-02 18:33:05] Features: 7/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 1.6s finished

[2021-08-02 18:33:06] Features: 8/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.3s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.5s finished

[2021-08-02 18:33:08] Features: 9/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 1.1s finished

[2021-08-02 18:33:09] Features: 10/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 1.1s finished

[2021-08-02 18:33:10] Features: 11/16 -- score: 0.9940737489025461[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished

[2021-08-02 18:33:11] Features: 12/16 -- score: 0.9910886742756805[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.7s finished

[2021-08-02 18:33:11] Features: 13/16 -- score: 0.9881035996488148[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.5s finished

[2021-08-02 18:33:12] Features: 14/16 -- score: 0.9761633011413521[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.4s finished

[2021-08-02 18:33:12] Features: 15/16 -- score: 0.9791483757682178[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s finished

[2021-08-02 18:33:12] Features: 16/16 -- score: 0.672519754170325

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.6s finished

[2021-08-02 18:33:14] Features: 1/11 -- score: 0.9941176470588236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.8s finished

[2021-08-02 18:33:15] Features: 2/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.9s finished

[2021-08-02 18:33:15] Features: 3/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 1.6s finished

[2021-08-02 18:33:17] Features: 4/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 1.7s finished

[2021-08-02 18:33:19] Features: 5/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 1.6s finished

[2021-08-02 18:33:20] Features: 6/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.6s finished

[2021-08-02 18:33:22] Features: 7/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 1.3s finished

[2021-08-02 18:33:23] Features: 8/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.2s finished

[2021-08-02 18:33:24] Features: 9/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 1.1s finished

[2021-08-02 18:33:25] Features: 10/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.9s finished

[2021-08-02 18:33:26] Features: 11/11 -- score: 0.9940737489025461

[1, 2, 4, 5, 6, 8, 9, 10, 11, 13, 14]

Index(['Age', 'Income', 'CCAvg', 'Mortgage', 'SecuritiesAccount', 'Online', 'CreditCard', 'Regions_Central', 'Regions_Los Angeles Region', 'Regions_Superior', 'Education_2'], dtype='object')

Now we will fit a sklearn model using these features only

MODEL PERFORMANCE

Accuracy : Train: 0.677 Test: 0.686 Recall : Train: 0.997 Test: 0.972 Precision : Train: 0.229 Test: 0.231 F1 : Train: 0.372 Test: 0.373

image.png

image-2.png

Model   Train_Accuracy  Test_Accuracy   Train Recall    Test Recall Train Precision Test Precision  Train F1    Test F1

0 Logistic Regression Model- Sklearn 0.66 0.65 0.98 0.99 0.22 0.21 0.35 0.35 1 Logistic Regression Model - Statsmodels 0.96 0.96 0.68 0.67 0.86 0.84 0.76 0.75 2 Logistic Regression - Optimal threshold = 0.092 0.90 0.90 0.90 0.88 0.49 0.48 0.63 0.62 3 Logistic Regression - Optimal threshold = 0.3 0.95 0.94 0.80 0.73 0.75 0.70 0.77 0.72 4 Logistic Regression - Sequential feature selection 0.68 0.69 1.00 0.97 0.23 0.23 0.37 0.37

Model building Decision Tree

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Build Model

  1. We are using 'gini' criteria to split.
  2. If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.

  3. To handle this imbalanced data set,we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.

  4. class_weight is a hyperparameter for the decision tree classifier.

  5. Since not being able to identify a potential customer is the biggest loss as mentioned earlier with logistic regression. Hence, recall is the right metric to check the performance of the model.

Decision tree tends to Overfit and the disparity between the Recall on Train and Test suggest that the model is overfitted

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

  1. Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  2. It is an exhaustive search that is performed on a the specific parameter values of a model.
  3. The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
  4. Let's see if we can improve our model performance even more.

Observations

  1. With HyperParameter max_depth=6, max_leaf_nodes=20, min_samples_leaf=7 the overfitting on train has reduced, but the recall for test has not improved.
  2. Important features are Income,Education 2 and Education 3, Family 4, Family 3, CCavg & Age.
  3. But the recall metric is still 91 and false negatives are 12.We don't want to loose opportunity in predicting this customers. so Let see if instead of pre pruning , post pruning helps in reducing false negative.

Cost Complexity Pruning

Next, we train a decision tree using the effective alphas. We will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, we will find the accuracy on both Train and Test parts of our dataset.

We are gettingt a higher recall on test data between 0.002 to 0.005. Will choose alpha as 0.002.

Creating model with 0.002 ccp_alpha

The Recall on train and test indicate we have created a generalized model. with 96 % accuracy and reduced False negatives.

We are getting a higher recall on test data between 0.002 to 0.005. Will choosed alpha as 0.002. The Recall on train and test indicate we have created a generalized model. with 96 % accuracy and reduced False negatives. Important features : Income, Graduate education, Family member 3 and 4, Ccavg, Advanced education, Age. This is the best model as false negative is only 6 on Testdata.

Comparing all the models based on Model Performance

Decision tree model post pruning has given us best recall scores on data with 97% accuracy . Exploratory data analysis also suggested income and education were important features in deciding if person will borrow personal loan. so choosing Decision Tree with post-pruning for our prediction.

Observation

After Post Pruning ,the false negative has reduced to 6.The accuracy on test data is 97% & Recall is 97% after choosing optimal cc-alpha.

Conclusion

  1. We analyzed the Personal Loan campaign data using EDA and by using different models like Logistic Regression and Decision Tree Classifier to build a likelihood of Customer buying Loan.
  2. First we built model using Logistic Regression and performance metric used was Recall. The most important features for classification were Income,Education, CD account ,Family and CCAvg .
  3. Coefficient of Income, Graduate and Advanced Education, Family_3,Family 4,CCavg,CD account,Age, are positive , ie a one unit increase in these will lead to increase in chances of a person borrowing loan
  4. Coefficient of Securities account,online ,Family_2 credit card are negative increase in these will lead to decrease in chances of a person borrowing a loan.
  5. We also improved the performance using ROC-AUC curve and optimal threshold .This was best model with high recall and accuracy .
  6. Decision tree can easily overfit. They require less datapreprocessing compared to logistic Regression and are easy to understand.
  7. We used decision trees with prepruning and post pruning. The Post pruning model gave 96 % recall with 97% accuracy.
  8. Income, Customers with graduate degree, customers having 3 family members are some of the most important variables in predicting if the customers will purchase a personal loan.

Actionable Insights & Recommendations

Misclassification Analysis

Percentage of value predicted by our model has been very close to the actual values. Lets find out False Negative and False Positive observations

Our model predicted 6 customers wrongly. On analyzing the Income , Education,Family , we can see the income is not in range of High income group and education is undergrad for most of them and there average spending is also low. These cases are some exceptions.

On analyzing the Education , we can see most of them have education as Advance or Graduate . These cases are some exceptions.

Recommendation

  1. Decision trees doesn't require to much data preparation or handling of outliers like logistic regression. They are easy to understand. Decision tress can easily overfit , so we have to be careful using decision tree.
  2. Based on EDA, logistic Regression , Decision tree , Income ,Educatoin,Family,CCavg are most important factor.
  3. Customers who have income above 98k dollars , Advance/graduate level education, a family of more than 2, such customers have higher chances of taking personal loans.
  4. So for this campaign we can have different profiles for customers.
  5. High Profile Clients :-Higher income,Advanced/Graduate level education, 3 /4 Family members,high spending
  6. Average Profile :- Medium income group,Graduate level education.3/4Family members,medium spending
  7. Low Profile:-Lower income group,undergrads ,3/4Family Member,low spending
  8. Customer Average Spending and Mortages can also be looked upon as based on EDA and logistic Regression this parameters also play some role in likelihood of buy loan.
  9. We can 1st target high profile customers , by providing them with a personal relationship managers who can address there concerns and can pursue them to buy loan from the bank with completive interest rates.
  10. Prequalifying for Loan can also attract more customers.
  11. Our 2nd target would be Medium profile customers.
  12. The model cannot identify well if there are some exceptional cases when low profile customer is ready to buy a personal loan.